-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Don't execute the proxy queue while adding to it from same thread. #24565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Still working on a more reliable test, but I wanted to try this out... |
9bf0082
to
fcffdb1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
Hopefully we can also figure out some way to improve the testing ?
fcffdb1
to
ff117cf
Compare
Solution LGTM, but I forgot about the testing. |
Changing to using a thread local causes failures in |
Also use `bool` rather than `int` to better convey intent. Split out from emscripten-core#24565
Also use `bool` rather than `int` to better convey intent. Split out from emscripten-core#24565
Also use `bool` rather than `int` to better convey intent. Split out from emscripten-core#24565
There was some suspicion that this might lead to a code size change but it doesn't seem to (I suppose because bool/int doesn't have any code in its constructor). Also use `bool` rather than `int` to better convey intent. Split out from #24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
TLS was not initialized on workers causing the tls_base address to be 0 and write to the wrong memory location. Fixes emscripten-core#21528 and is also needed for PR emscripten-core#24565
ff117cf
to
9ca233d
Compare
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
test_pthread_cancel was intermittently failing with a deadlock. From the backtrace, I found we were adding to the queue which can sometimes call malloc->emscripten_yield->emscripten_proxy_execute_queue which also tries to lock the same queue on the same thread.
9ca233d
to
a76f7e2
Compare
Unfortunately, I still haven't found a reliable way to reproduce this in a test case. Are we fine to land without? |
Does this test sketch sound plausible?
Maybe we can also hook the allocator to set flags as well and prove that the allocations happened at the expected times during the test. We do something similar when testing frees here: https://github.com/emscripten-core/emscripten/blob/main/test/pthread/test_pthread_proxying_refcount.c#L25-L28. If that doesn't sound plausible, then testing by observing fewer flakes SGTM. |
@tlively (I added some numbers to your steps above). Is the expectation that we could make step 4 deterministically fail without this PR? |
Yes, that's what I was aiming for. I haven't tried it, though :) |
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
This would really have helped in at least detecting the deadlock that was found in emscripten-core#24570 / emscripten-core#24565
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm % test comments
|
||
int main() { | ||
pthread_create(&looper, NULL, looper_main, NULL); | ||
emscripten_proxy_async(emscripten_proxy_get_system_queue(), looper, run_on_looper, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this ever set should_quit before looper_main actually starts running?
Can you add some comments somewhere in this file about exactly that case we are testing for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of assumed the thread start would be guaranteed to run before any proxy messages, but now I'm not sure. Even if it does run before, we'd still hit a deadlock since the main thread will be locked during task creation and trying to run the queue and lock again. It doesn't really matter what the other thread is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have looper_main
set a flag and have main
wait for that flag to be set before doing the proxying to be sure. +1 to adding comments, especially about how the deadlock would occur without the bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the test a bit more and removed the looper thread and just sends the task to the same thread. This makes the test a little less realistic, but highlights it doesn't really matter what the thread is doing.
// 3. Malloc then attempts to execute and lock the already-locked queue, | ||
// causing a deadlock. | ||
// This test ensures our implementation prevents this re-entrant lock. | ||
emscripten_proxy_async(emscripten_proxy_get_system_queue(), pthread_self(), task, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nice, so you don't even need a second thread! Makes sense I guess since this is a double-lock/self-deadlock issue.
Is it valid to send work to yourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_pthread_proxying.c has a test for proxing to itself, so I assume so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but only asynchronous self-proxying is allowed, since synchronous self-proxying would obviously deadlock.
test_pthread_cancel was intermittently failing with a deadlock. From the backtrace, I found we were adding to the queue which can sometimes call malloc->emscripten_yield->emscripten_proxy_execute_queue which also tries to lock the same queue on the same thread.